Preparation
Load all necessary packages.
library(tidyverse) # to ease data cleansing, use ggplot
library(ggfortify) # to assist ggplot in plotting time series object
library(lubridate) # to process datetime variables
library(forecast) # to perform time series modelling
library(MLmetrics) # to evaluate the models using MAE
library(TSstudio) # to plot the forecasted results
library(tseries) # to use adf.test (staionarity test)
Load the train dataset and view its first five rows.
fnb <- read.csv("data/data-train.csv")
head(fnb)
## transaction_date receipt_number item_id item_group item_major_group
## 1 2017-12-01T13:32:46Z A0026694 I10100139 noodle_dish food
## 2 2017-12-01T13:32:46Z A0026694 I10500037 drinks beverages
## 3 2017-12-01T13:33:39Z A0026695 I10500044 drinks beverages
## 4 2017-12-01T13:33:39Z A0026695 I10400009 side_dish food
## 5 2017-12-01T13:33:39Z A0026695 I10500046 drinks beverages
## 6 2017-12-01T13:35:59Z A0026696 I10300002 pastry food
## quantity price_usd total_usd payment_type sales_type
## 1 1 7.33 7.33 cash dine_in
## 2 1 4.12 4.12 cash dine_in
## 3 1 2.02 2.02 cash dine_in
## 4 1 5.60 5.60 cash dine_in
## 5 1 3.01 3.01 cash dine_in
## 6 1 4.86 4.86 cash dine_in
str(fnb)
## 'data.frame': 137748 obs. of 10 variables:
## $ transaction_date: Factor w/ 35477 levels "2017-12-01T13:32:46Z",..: 1 1 2 2 2 3 3 3 3 4 ...
## $ receipt_number : Factor w/ 35522 levels "A0025932","A0026536",..: 3 3 4 4 4 5 5 5 5 6 ...
## $ item_id : Factor w/ 264 levels "I10100001","I10100002",..: 52 162 165 130 167 108 46 197 160 232 ...
## $ item_group : Factor w/ 14 levels "coffee","dessert",..: 8 3 3 13 3 10 8 1 3 14 ...
## $ item_major_group: Factor w/ 4 levels "beverages","food",..: 2 1 1 2 1 2 2 1 1 4 ...
## $ quantity : int 1 1 1 1 1 1 1 1 1 2 ...
## $ price_usd : num 7.33 4.12 2.02 5.6 3.01 4.86 6.34 7.58 4.12 1.77 ...
## $ total_usd : num 7.33 4.12 2.02 5.6 3.01 4.86 6.34 7.58 4.12 3.54 ...
## $ payment_type : Factor w/ 2 levels "cash","non_cash": 1 1 1 1 1 1 1 1 1 1 ...
## $ sales_type : Factor w/ 3 levels "delivery","dine_in",..: 2 2 2 2 2 2 2 2 2 2 ...
As stated earlier in the Aim section above, we’re going to perform time series analysis. Therefore, looking at above data frame, we can see there are many unnecessary features. For time series, we’re only interested in two features, i.e. time and the feature in forecasting question (which is visitor).
However, the data frame above does not have the visitor feature we need. Thus, we have to create it ourselves. From all variables available in the data frame, we can see transaction_date contains date-time values that we’re going to use it as date-time feature. However, to find good feature candidate for visitor feature, we have to carefully look again to and read the Content section. There, we can see that receipt_number could be a good candidate as it represents the ID of every single transaction. And every transaction means every customer. If we accurately look at the data frame, each item_id could have the same receipt_number as other item_id. Since item_id represents The ID of an item in a transaction, it is clear that every receipt_number could have more than one items or item_id. In other words, every customer or visitor can order more than one food and/or drink. Therefore, we agree to select receipt_number as the candidate to create visitor feature.
Now, let’s drop all unnecessary variables, except transaction_date and receipt_number. Next, since the transaction_date column is not in correct data type as shown by str() function above, we need to change its data type to date-time by using ymd_hms() function from the lubridate library.
fnb <- fnb %>% select(transaction_date, receipt_number)
fnb$transaction_date <- ymd_hms(fnb$transaction_date)
head(fnb)
## transaction_date receipt_number
## 1 2017-12-01 13:32:46 A0026694
## 2 2017-12-01 13:32:46 A0026694
## 3 2017-12-01 13:33:39 A0026695
## 4 2017-12-01 13:33:39 A0026695
## 5 2017-12-01 13:33:39 A0026695
## 6 2017-12-01 13:35:59 A0026696
We’ve obtained necessary columns. If we show their first 6 rows, all date-time values have values until minutes and seconds. We need to round them in their hour in order to facilitate us in counting the visitors for each hour. We can use the floor_date function from lubridate and set the unit parameter to "hour" as we want to round them in hour. Afterwards, we can group all rows in the transaction_date column in order to count the number of unique receipt_number (since every single receipt_number means every single visitor). From such counted values, we can create a new column called visitor.
fnb <- fnb %>%
mutate(transaction_date = floor_date(x = .$transaction_date, unit = "hour")) %>%
group_by(transaction_date) %>%
summarise(visitor = n_distinct(receipt_number)) %>%
ungroup()
fnb
## # A tibble: 1,244 x 2
## transaction_date visitor
## <dttm> <int>
## 1 2017-12-01 13:00:00 16
## 2 2017-12-01 14:00:00 38
## 3 2017-12-01 15:00:00 27
## 4 2017-12-01 16:00:00 29
## 5 2017-12-01 17:00:00 44
## 6 2017-12-01 18:00:00 50
## 7 2017-12-01 19:00:00 66
## 8 2017-12-01 20:00:00 70
## 9 2017-12-01 21:00:00 63
## 10 2017-12-01 22:00:00 63
## # ... with 1,234 more rows
Next, if we look at the data frame above, the transaction_date column does not have full 24 hours time series. From the beginning of hour 13:00:00, its irregularity starts. Sometimes, the hour of a day begins at 10 AM, 9 AM, or 1 PM. Hence, we have to solve this irregularity by unifying every single day with 24 hours, i.e. every single day starting at 0 AM and ending at 23 PM. This is called time series padding. We can do this by using the pad function from the padr library as shown below.
min_date <- min(fnb$transaction_date) # define the last date
max_date <- max(fnb$transaction_date) # define the first date
# we make the first date-time value. Since the min_date contains hour value in 1 PM, we dont want this so that we have to create an hour of "00:00:00"
start_val <- make_datetime(year = year(min_date), month = month(min_date), day = day(min_date), hour = hour(hms("00:00:00")))
# we make the last date-time value
end_val <- make_datetime(year = year(max_date), month = month(max_date), day = day(max_date), hour = hour(max_date))
# we perform time series padding and assign the results in a new variable called fnb_pad
fnb_pad <- padr::pad(x = fnb, start_val = start_val, end_val = end_val)
## pad applied on the interval: hour
# show the data frame
fnb_pad
## # A tibble: 1,920 x 2
## transaction_date visitor
## <dttm> <int>
## 1 2017-12-01 00:00:00 NA
## 2 2017-12-01 01:00:00 NA
## 3 2017-12-01 02:00:00 NA
## 4 2017-12-01 03:00:00 NA
## 5 2017-12-01 04:00:00 NA
## 6 2017-12-01 05:00:00 NA
## 7 2017-12-01 06:00:00 NA
## 8 2017-12-01 07:00:00 NA
## 9 2017-12-01 08:00:00 NA
## 10 2017-12-01 09:00:00 NA
## # ... with 1,910 more rows
From the data frame shown above, we can see a lot of missing values. To see them clearly, let’s count them.
is.na(fnb_pad) %>% colSums()
## transaction_date visitor
## 0 676
There are 676 missing values in the visitor column. This is unsurprised results as we have carried out time series padding, then the blank rows in visitor columns are filled with missing values. We need this missing values because basically this represents the absence of visitor, namely 0 visitor. Therefore, instead of filling with NAs, we’d like to fill such missing values with 0 value.
fnb_pad$visitor <- replace_na(fnb_pad$visitor, 0)
fnb_pad
## # A tibble: 1,920 x 2
## transaction_date visitor
## <dttm> <dbl>
## 1 2017-12-01 00:00:00 0
## 2 2017-12-01 01:00:00 0
## 3 2017-12-01 02:00:00 0
## 4 2017-12-01 03:00:00 0
## 5 2017-12-01 04:00:00 0
## 6 2017-12-01 05:00:00 0
## 7 2017-12-01 06:00:00 0
## 8 2017-12-01 07:00:00 0
## 9 2017-12-01 08:00:00 0
## 10 2017-12-01 09:00:00 0
## # ... with 1,910 more rows
is.na(fnb_pad) %>% colSums()
## transaction_date visitor
## 0 0
All NAs have been changed to 0 value. All clean. The dataset now is ready for exploratory data analysis.
Exploratory Data Analysis
Visualize the Dataset
In this chapter, we’re going to explore and understand more the dataset. We could do that by visualizing its distribution of number of visitors in every hour and day. To do that, firstly, we need to make two new columns, i.e. hour and day columns.
fnb_pad$hour <- hour(fnb_pad$transaction_date) # create the hour column
fnb_pad$day <- wday(x = fnb_pad$transaction_date, label = T, abbr = F) # create the day column
fnb_pad
## # A tibble: 1,920 x 4
## transaction_date visitor hour day
## <dttm> <dbl> <int> <ord>
## 1 2017-12-01 00:00:00 0 0 Friday
## 2 2017-12-01 01:00:00 0 1 Friday
## 3 2017-12-01 02:00:00 0 2 Friday
## 4 2017-12-01 03:00:00 0 3 Friday
## 5 2017-12-01 04:00:00 0 4 Friday
## 6 2017-12-01 05:00:00 0 5 Friday
## 7 2017-12-01 06:00:00 0 6 Friday
## 8 2017-12-01 07:00:00 0 7 Friday
## 9 2017-12-01 08:00:00 0 8 Friday
## 10 2017-12-01 09:00:00 0 9 Friday
## # ... with 1,910 more rows
By using just created columns and the visitor column, we can plot the distribution of the dataset as shown below. We also could utilize the ggplotly function from the plotly library in order to make the plot interactive.
plotly::ggplotly(fnb_pad %>% ggplot(aes(x = hour, y = visitor)) +
geom_col(aes(fill = day)) +
theme_minimal() +
labs(title = "Number of visitors",
x = "Hour",
y = "Number of visitors",
fill = "Day"))
As shown in the plot above, at a glance, it can be seen that between 2 AM and 9 AM, there is no visitor. However, as mentioned earlier in the Preparation section, exactly in the part before implementing time series padding, there exist a day which starts at 9 AM. Therefore, if we zoom in the plot above to the part between 9 AM and 10 AM, a day which starts at 9 AM can be seen. Uniquely, such a day only has one visitor.
In addition, the plot above also indicates that the nighter a day, the more the visitors come. Nevertheless, this phenomenon only applies from the opening time to 8 PM since after 8 PM, the number of visitors gradually decreases.
For the proportion of the number of visitors, it is approximately balanced among each day of week. However, to make it clear, let’s compare each day of week by using boxplots.
fnb_pad %>% ggplot(aes(x = day,
y = visitor)) +
geom_boxplot(fill = "green") +
geom_jitter(aes(color = hour)) +
labs(title = "Boxplots of each day",
x = "Day",
y = "Number of visitors",
color = "Hour")

From the boxplots above, it can be seen that weekend (Saturday and Sunday) relatively has more visitors than weekday. It is clear that the median of Saturday and Sunday is higher than that of weekday. From Monday to Thurday, their median is relatively the same. However, something weird happens when we see the median of Friday. Such day has the least median value while its distribution is wider than other weekdays. This indicates that Friday tends to form random pattern.
Visualize the Time Series Object (1 Seasonal) {#1seas}
Subsequently, we’re going to visualize the dataset in time series object. To do this, first of all, we need to create a time series object by using the ts() function. The column in interest is the visitor column. Since the dataset is in hourly observations, we can set the frequency parameter to 24 (i.e. 24 hours) to become a single day. Then, let’s plot the time series object.
fnb_ts <- ts(data = fnb_pad$visitor, frequency = 24)
autoplot(fnb_ts)

After the time series object created, now let’s decompose it to see its composition (i.e. sesonality, trend, and residual/remainder) by using the decompose() function. Then, let’s plot the decomposed time series object.
fnb_dec <- decompose(x = fnb_ts, type = "additive")
fnb_dec %>% autoplot()

It can be seen above that the trend still has patterns. This implicitly indicates that the time series object has more than one seasonality. We will cope with this issue little bit later. For now, just move on using 24-hour seasonal. Now, let’s plot the seasonality.
fnb_pad %>%
mutate(seasonal = fnb_dec$seasonal) %>% # create the seasonal column from decomposed ts object
distinct(hour, seasonal) %>% # only show the unique values
ggplot(mapping = aes(x = hour, y = seasonal)) +
geom_col() +
scale_x_continuous(breaks = seq(0,24,1)) +
labs(title = "Seasonality Pattern Analysis",
subtitle = "Hourly") +
theme(axis.text.y = element_blank()) # we're not interested in seasonal values

The plot above shows the seasonality of 24 hours. Here, we only use one seasonality, i.e. from hours to day. Since the seasonality values in y-axis are uninterpretable, we’re not interested to show the values. We’re only interested in the bar changes. One thing that can be interpreted from the plot above is that every day the seasonality indicates a cycle. From 0 AM, there is significant decrease, and the values remain stable from 2 AM to 9 AM (although 9 AM shows a slight increase). From 10 AM to 8 PM, the values keep increase, whereas the values decrease afterwards. What the seasonality shows us is similar to what we’ve found in the dataset visualization above.
Visualize the Time Series Object (2 Seasonals)
Next, as we’ve discovered earlier that, by using one seasonality, the trend still has patterns, it is recommended to add more seasonality to the time series object. Since our first seasonality is in hourly to day, now our second seasonality will be in daily to weekly, namely 24 hours to 7 days (i.e. 24 x 7).
The msts() function from the forecast library can be used to catch more than one seasonality. As now our seasonalities are in hourly to daily and daily to weekly (i.e. 24 and 24 x 7, respectively), we can set the seasonal.periods parameter for the function as 24 and 24 x 7. Then, in order to decompose the multiple seasonality, instead of using the decompose() function, we’re going to use the mstl() function from the forecast library as well. After the decomposed multiple seasonality time series object created, we can plot it.
msts(data = fnb_pad$visitor, seasonal.periods = c(24, 24*7)) %>%
mstl() %>% # decompose the multiple seasonality time series object
autoplot() # plot the decomposed one

fnb_msts <- msts(data = fnb_pad$visitor, seasonal.periods = c(24, 24*7)) # save time series object
fnb_msts_dec <- mstl(fnb_msts) # save the decomposed one
By using the time series object with two seasonality, we’re going to plot the first seasonality. We can do that by executing below codes.
as.data.frame(fnb_msts_dec) %>% # change data type to dataframe as ggplot only accepts dataframe
mutate(hour = fnb_pad$hour) %>% # create the hour column
ggplot(mapping = aes(x = hour, y = Seasonal24)) + # Seasonal24 contains 1st seasonal
geom_col() +
labs(title = "Seasonality Pattern Analysis",
subtitle = "Hourly") +
theme(axis.text.y = element_blank()) # we're not interested in seasonal values

The plot above exactly shows the same results as the plot of time series object with only one seasonality. Therefore, the interpretation of the plot is also the same.
Subsequently, let’s plot the second seasonality.
as.data.frame(fnb_msts_dec) %>% # change data type to dataframe as ggplot only accepts dataframe
mutate(day = fnb_pad$day) %>% # create the day column
ggplot(mapping = aes(x = day, y = Seasonal168)) + # Seasonal168 contains 2nd seasonal (24*7 = 168)
geom_col() +
labs(title = "Seasonality Pattern Analysis",
subtitle = "Daily") +
theme(axis.text.y = element_blank()) # we're not interested in seasonal values

The plot above shows something interesting. It is completely different from the plot of first/one seasonality. This second seasonality plot is rather similar to the boxplot we’ve created earlier in the dataset visualization. What can be interpreted from the plot above is that weekend tends to have more significant numbers than weekday.
Now, let’s combine both seasonalities and plot them together.
# double seasonality plotted together
as.data.frame(fnb_msts_dec) %>% # change data type to dataframe as ggplot only accepts dataframe
mutate(day = fnb_pad$day, # create the day column
hour = fnb_pad$hour) %>% # create the hour column
group_by(day, hour) %>% # group the data frame by day and hour
summarise(seasonal = sum(Seasonal24 + Seasonal168)) %>% # then create new column called seasonal whose content is the combination between 1st and 2nd seasonality
ggplot(mapping = aes(x = hour, y = seasonal)) +
geom_col(aes(fill = day)) +
scale_fill_viridis_d(option = "plasma") +
labs(title = "Seasonality Pattern Analysis",
subtitle = "Hourly and Daily") +
theme(axis.text.y = element_blank()) # we're not interested in seasonal values

The plot above is similar to the plot of first/one seasonality, but with more descriptions in daily information. Therefore, its interpretation is similar as well, but with an addition of the second seasonality plot interpretation.
Modelling and Evaluation
We’ve explored the dataset and its time series object. Now, let’s move on to create the models and evaluate them. As explained earlier, we have two types of time series object, i.e. the one with one seasonality only and the one with two seasonalities. Here, we’re going to examine both. First, let’s examine the one with one seasonality only.
Modelling of One Seasonality
Firstly, we should check the assumptions. For time series, usually, there are three assumptions, i.e. normality of residuals, no-autocorrelation, and stationarity. However, as we’re going to employ triple exponential smoothing (Holt Winters) and ARIMA/SARIMA and as mentioned here and here, we won’t consider to use normality of residuals. We will just use no-autocorrelation and stationarity from now on.
Now, let’s check whether autocorrelation exists. We can use the Box.test() function. We expect to have p-value greater than 0.05.
Box.test(fnb_dec$random)
##
## Box-Pierce test
##
## data: fnb_dec$random
## X-squared = 314.38, df = 1, p-value < 2.2e-16
The p-value is too far from our expectation. Then, let’s check the stationarity.
adf.test(x = fnb_ts)
## Warning in adf.test(x = fnb_ts): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: fnb_ts
## Dickey-Fuller = -20.952, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary
Now, let’s split the time series object into two parts, i.e. the train data and the validation data. We don’t use the term of test data here as we reserve it for the final test. Since the aim of this article is to forecast 7 days, we’re going to create the validation data with length of 7 days as well (24 x 7).
# split train-validation 1 seasonal
train <- head(fnb_ts, length(fnb_ts) - 24 * 7)
validation <- tail(fnb_ts, 24 * 7)
Now, let’s create the model using triple exponential smoothing. We can use the HoltWinters() and ets() functions.
model_holt<- HoltWinters(train, seasonal = "additive")
model_ets <- ets(train, model = "ZZZ")
Now, let’s evaluate both models by forecasting the validation data and check their performance in MAE (Mean Absolute Error). Then, plot the results.
forecast_holt <- forecast(model_holt, h = 24 * 7) # model HoltWinters
forecast_ets <- forecast(model_ets, h = 24 * 7) # model ets
# check MAE
MAEoneHolt <- MAE(y_pred = forecast_holt$mean, y_true = validation)
MAEoneHolt
## [1] 7.187944
MAEoneETS <- MAE(y_pred = forecast_ets$mean, y_true = validation)
MAEoneETS
## [1] 5.83201
# plot the results
plot_forecast(forecast_holt)
plot_forecast(forecast_ets)
We also can plot the results by including the train data.
test_forecast(actual = fnb_ts, forecast.obj = forecast_holt,
train = train, test = validation)
test_forecast(actual = fnb_ts, forecast.obj = forecast_ets,
train = train, test = validation)
We found that the model using ets() function performs superior in term of MAE. Then, by looking at their plot, the model using HoltWinters() function tends to increase the estimated results. Instead, the other tends to be flat.
Now, let’s use another model, i.e. ARIMA/SARIMA. We can use the auto.arima() function to find the best (by forecast library). However, before we employ ARIMA/SARIMA, we have to ensure that our data is stationary. We can check it by using adf.test() function. We expect to have p-value less than 0.05.
adf.test(fnb_ts)
## Warning in adf.test(fnb_ts): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: fnb_ts
## Dickey-Fuller = -20.952, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary
We have nice p-value. Now, let’s make the model.
model_arima <- auto.arima(train)
Then, let’s see the model’s summary.
summary(model_arima)
## Series: train
## ARIMA(1,0,2)(2,1,0)[24]
##
## Coefficients:
## ar1 ma1 ma2 sar1 sar2
## 0.7905 -0.3247 -0.0638 -0.6011 -0.3075
## s.e. 0.0343 0.0429 0.0321 0.0238 0.0234
##
## sigma^2 estimated as 42.41: log likelihood=-5692.67
## AIC=11397.35 AICc=11397.39 BIC=11430.07
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.007168512 6.458447 4.240828 NaN Inf 0.7730118 -0.003648501
In term of MAE, the model can perform small MAE. Now, let’s use it to forecast the validation data.
forecast_arima <- forecast(model_arima, h = length(validation))
MAEoneARIMA <- MAE(y_pred = forecast_arima$mean, y_true = validation)
MAEoneARIMA
## [1] 6.827283
plot_forecast(forecast_arima)
# ARIMA(2,0,2)(0,1,0)[168]
test_forecast(actual = fnb_ts, forecast.obj = forecast_arima,
train = train, test = validation)
Eventhough the SARIMA model results in slightly worse MAE, if we its plot, it can follow the pattern of validation data better than the two models earlier.
Modelling of Two Seasonalities
As explained earlier that we have two types of time series object, we also consider two types of model, i.e. the model with single seasoanal and the model with multiple seasonal. We have discussed the one with single seasonal. Now, we’re going to discuss the second type. As we have created the time series object for multiple seasonal, now let’s split the data.
Here, we also apply the same considerations as the one with single seasonality.
# split train-validation 2 seasonal
train_2 <- head(fnb_msts, length(fnb_msts) - 24 * 7)
validation_2 <- tail(fnb_msts, 24 * 7)
First, we’re going to create the model using triple exponential smoothing approach.
model_holt_msts<- HoltWinters(train_2)
forecast_holt2 <- forecast(model_holt_msts, h = 24 * 7)
MAEtwoHolt <- MAE(y_pred = forecast_holt2$mean, y_true = validation_2)
MAEtwoHolt
## [1] 4.346336
plot_forecast(forecast_holt2)
test_forecast(actual = fnb_msts, forecast.obj = forecast_holt2,
train = train_2, test = validation_2)
The Holt Winters model with multiple seasonality seems perform better. Its plot also show significance enhancement compared to when it only applies one seasonality.
Compare All Models
We have four models so far. Let’s compare them in term of MAE.
data.frame(Model = c("HoltWinters one seasonal",
"ets one seasonal",
"SARIMA one seasonal",
"HoltWinters two seasonal"),
MAE = c(MAEoneHolt,
MAEoneETS,
MAEoneARIMA,
MAEtwoHolt))
## Model MAE
## 1 HoltWinters one seasonal 7.187944
## 2 ets one seasonal 5.832010
## 3 SARIMA one seasonal 6.827283
## 4 HoltWinters two seasonal 4.346336
It can be seen that the model using HoltWinters() function with two seasonality is superior among others. Therefore, we’re going to use this model as our final model. Then, let’s develop our final model by training it using all available train data. Afterwards, we will forecast the test dataset.
model_final <- HoltWinters(fnb_msts, seasonal = "additive")
forecast_test <- forecast(model_final, h = 24*7)
Now, let’s fill in the test dataset. Firstly, let’s take a quick look at it.
fnb_test <- read.csv("data/data-test.csv")
fnb_test
## datetime visitor
## 1 2018-02-19T10:00:00Z NA
## 2 2018-02-19T11:00:00Z NA
## 3 2018-02-19T12:00:00Z NA
## 4 2018-02-19T13:00:00Z NA
## 5 2018-02-19T14:00:00Z NA
## 6 2018-02-19T15:00:00Z NA
## 7 2018-02-19T16:00:00Z NA
## 8 2018-02-19T17:00:00Z NA
## 9 2018-02-19T18:00:00Z NA
## 10 2018-02-19T19:00:00Z NA
## 11 2018-02-19T20:00:00Z NA
## 12 2018-02-19T21:00:00Z NA
## 13 2018-02-19T22:00:00Z NA
## 14 2018-02-20T10:00:00Z NA
## 15 2018-02-20T11:00:00Z NA
## 16 2018-02-20T12:00:00Z NA
## 17 2018-02-20T13:00:00Z NA
## 18 2018-02-20T14:00:00Z NA
## 19 2018-02-20T15:00:00Z NA
## 20 2018-02-20T16:00:00Z NA
## 21 2018-02-20T17:00:00Z NA
## 22 2018-02-20T18:00:00Z NA
## 23 2018-02-20T19:00:00Z NA
## 24 2018-02-20T20:00:00Z NA
## 25 2018-02-20T21:00:00Z NA
## 26 2018-02-20T22:00:00Z NA
## 27 2018-02-21T10:00:00Z NA
## 28 2018-02-21T11:00:00Z NA
## 29 2018-02-21T12:00:00Z NA
## 30 2018-02-21T13:00:00Z NA
## 31 2018-02-21T14:00:00Z NA
## 32 2018-02-21T15:00:00Z NA
## 33 2018-02-21T16:00:00Z NA
## 34 2018-02-21T17:00:00Z NA
## 35 2018-02-21T18:00:00Z NA
## 36 2018-02-21T19:00:00Z NA
## 37 2018-02-21T20:00:00Z NA
## 38 2018-02-21T21:00:00Z NA
## 39 2018-02-21T22:00:00Z NA
## 40 2018-02-22T10:00:00Z NA
## 41 2018-02-22T11:00:00Z NA
## 42 2018-02-22T12:00:00Z NA
## 43 2018-02-22T13:00:00Z NA
## 44 2018-02-22T14:00:00Z NA
## 45 2018-02-22T15:00:00Z NA
## 46 2018-02-22T16:00:00Z NA
## 47 2018-02-22T17:00:00Z NA
## 48 2018-02-22T18:00:00Z NA
## 49 2018-02-22T19:00:00Z NA
## 50 2018-02-22T20:00:00Z NA
## 51 2018-02-22T21:00:00Z NA
## 52 2018-02-22T22:00:00Z NA
## 53 2018-02-23T10:00:00Z NA
## 54 2018-02-23T11:00:00Z NA
## 55 2018-02-23T12:00:00Z NA
## 56 2018-02-23T13:00:00Z NA
## 57 2018-02-23T14:00:00Z NA
## 58 2018-02-23T15:00:00Z NA
## 59 2018-02-23T16:00:00Z NA
## 60 2018-02-23T17:00:00Z NA
## 61 2018-02-23T18:00:00Z NA
## 62 2018-02-23T19:00:00Z NA
## 63 2018-02-23T20:00:00Z NA
## 64 2018-02-23T21:00:00Z NA
## 65 2018-02-23T22:00:00Z NA
## 66 2018-02-24T10:00:00Z NA
## 67 2018-02-24T11:00:00Z NA
## 68 2018-02-24T12:00:00Z NA
## 69 2018-02-24T13:00:00Z NA
## 70 2018-02-24T14:00:00Z NA
## 71 2018-02-24T15:00:00Z NA
## 72 2018-02-24T16:00:00Z NA
## 73 2018-02-24T17:00:00Z NA
## 74 2018-02-24T18:00:00Z NA
## 75 2018-02-24T19:00:00Z NA
## 76 2018-02-24T20:00:00Z NA
## 77 2018-02-24T21:00:00Z NA
## 78 2018-02-24T22:00:00Z NA
## 79 2018-02-25T10:00:00Z NA
## 80 2018-02-25T11:00:00Z NA
## 81 2018-02-25T12:00:00Z NA
## 82 2018-02-25T13:00:00Z NA
## 83 2018-02-25T14:00:00Z NA
## 84 2018-02-25T15:00:00Z NA
## 85 2018-02-25T16:00:00Z NA
## 86 2018-02-25T17:00:00Z NA
## 87 2018-02-25T18:00:00Z NA
## 88 2018-02-25T19:00:00Z NA
## 89 2018-02-25T20:00:00Z NA
## 90 2018-02-25T21:00:00Z NA
## 91 2018-02-25T22:00:00Z NA
It can be seen from the data frame above that the test dataset only contains time series from 10 AM to 10 PM every day. Therefore, we have to assign our forecasted data so that it can fit the test dataset.
OpenTimeIdx <- grep("^10$", (rep(0:23, 7))) # create the index for each 10 AM every day
CloseTimeIdx <- grep("^22$", (rep(0:23, 7))) # create the index for each 10 PM every day
# create new column as visitor column by filtering using the indices defined above
testOutput <- forecast_test$mean[c(OpenTimeIdx[1]:CloseTimeIdx[1],
OpenTimeIdx[2]:CloseTimeIdx[2],
OpenTimeIdx[3]:CloseTimeIdx[3],
OpenTimeIdx[4]:CloseTimeIdx[4],
OpenTimeIdx[5]:CloseTimeIdx[5],
OpenTimeIdx[6]:CloseTimeIdx[6],
OpenTimeIdx[7]:CloseTimeIdx[7])]
fnb_test$visitor <- testOutput # fill in the visitor column
write.csv(fnb_test, file = "data/testFinal.csv") # save the dataset as csv
After the dataset above examined in the leaderboard, below is the result.

Its MAE is 5.64 and RMSE is 7.71